========================================================
by Saravanan Natarajan
========================================================
The Annual summary file from Environmental Protection Agency’s Air Quality System for year 2017 consists of various data sets related to pollution. Ground-level ozone and airborne particles are the two pollutants that pose the greatest threat to human health. In this project we will explore the good and sensitive days in the year 2017 for the various States in USA.
Get the data from the CSV file and explore
## 'data.frame': 1044 obs. of 19 variables:
## $ State : Factor w/ 54 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County : Factor w/ 809 levels "Abbeville","Ada",..: 43 166 178 210 249 253 340 361 408 445 ...
## $ Year : int 2017 2017 2017 2017 2017 2017 2017 2017 2017 2017 ...
## $ Days.with.AQI : int 135 58 140 241 105 178 140 243 11 167 ...
## $ Good.Days : int 111 49 128 219 100 118 125 103 11 143 ...
## $ Moderate.Days : int 23 9 12 22 5 58 14 134 0 24 ...
## $ Unhealthy.for.Sensitive.Groups.Days: int 1 0 0 0 0 1 1 6 0 0 ...
## $ Unhealthy.Days : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Very.Unhealthy.Days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Hazardous.Days : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Max.AQI : int 108 66 63 80 58 163 131 129 38 100 ...
## $ X90th.Percentile.AQI : int 61 54 50 50 48 62 51 76 30 58 ...
## $ Median.AQI : int 41 27 41 40 38 45 38 53 18 42 ...
## $ Days.CO : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Days.NO2 : int 0 0 0 0 0 0 0 2 0 0 ...
## $ Days.Ozone : int 104 0 110 224 105 73 106 73 0 118 ...
## $ Days.SO2 : int 0 0 0 0 0 0 0 45 0 0 ...
## $ Days.PM2.5 : int 31 58 30 17 0 105 34 123 11 22 ...
## $ Days.PM10 : int 0 0 0 0 0 0 0 0 0 27 ...
The data set consists of 1044 observations with 19 variables for the year 2017.
The complete details about dataset is provided in https://www.epa.gov/outdoor-air-quality-data/about-air-data-reports#aqi
Each row of the AQI Report lists summary values for one year for one county.
https://www.airnow.gov/index.cfm?action=aqibasics.aqi
The summary values include both qualitative measures (days of the year having “good” air quality, for example) and descriptive statistics (median AQI value, for example).
## State County Year Days.with.AQI
## California : 50 Washington: 13 Min. :2017 Min. : 5.0
## Texas : 47 Jefferson : 12 1st Qu.:2017 1st Qu.:178.0
## Ohio : 41 Franklin : 8 Median :2017 Median :212.0
## Indiana : 40 Jackson : 8 Mean :2017 Mean :202.7
## Florida : 39 Lake : 8 3rd Qu.:2017 3rd Qu.:270.0
## North Carolina: 38 Montgomery: 8 Max. :2017 Max. :304.0
## (Other) :789 (Other) :987
## Good.Days Moderate.Days Unhealthy.for.Sensitive.Groups.Days
## Min. : 5.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:131.0 1st Qu.: 12.00 1st Qu.: 0.000
## Median :168.5 Median : 27.00 Median : 0.000
## Mean :164.4 Mean : 35.60 Mean : 2.247
## 3rd Qu.:209.2 3rd Qu.: 50.25 3rd Qu.: 2.000
## Max. :302.0 Max. :178.00 Max. :90.000
##
## Unhealthy.Days Very.Unhealthy.Days Hazardous.Days
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.00000
## Median : 0.0000 Median : 0.00000 Median : 0.00000
## Mean : 0.4138 Mean : 0.04023 Mean : 0.02107
## 3rd Qu.: 0.0000 3rd Qu.: 0.00000 3rd Qu.: 0.00000
## Max. :106.0000 Max. :15.00000 Max. :10.00000
##
## Max.AQI X90th.Percentile.AQI Median.AQI Days.CO
## Min. : 1.0 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 77.0 1st Qu.: 48.00 1st Qu.: 33.00 1st Qu.: 0.000
## Median : 97.0 Median : 54.00 Median : 39.00 Median : 0.000
## Mean : 105.7 Mean : 55.08 Mean : 35.57 Mean : 0.274
## 3rd Qu.: 119.0 3rd Qu.: 62.00 3rd Qu.: 42.00 3rd Qu.: 0.000
## Max. :3439.0 Max. :208.00 Max. :135.00 Max. :162.000
##
## Days.NO2 Days.Ozone Days.SO2 Days.PM2.5
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 0 Median :122.0 Median : 0.0 Median : 44.0
## Mean : 3 Mean :121.1 Mean : 10.6 Mean : 59.7
## 3rd Qu.: 0 3rd Qu.:192.0 3rd Qu.: 0.0 3rd Qu.: 91.0
## Max. :190 Max. :304.0 Max. :304.0 Max. :302.0
##
## Days.PM10
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 8.006
## 3rd Qu.: 0.000
## Max. :284.000
##
## [1] "State"
## [2] "County"
## [3] "Year"
## [4] "Days.with.AQI"
## [5] "Good.Days"
## [6] "Moderate.Days"
## [7] "Unhealthy.for.Sensitive.Groups.Days"
## [8] "Unhealthy.Days"
## [9] "Very.Unhealthy.Days"
## [10] "Hazardous.Days"
## [11] "Max.AQI"
## [12] "X90th.Percentile.AQI"
## [13] "Median.AQI"
## [14] "Days.CO"
## [15] "Days.NO2"
## [16] "Days.Ozone"
## [17] "Days.SO2"
## [18] "Days.PM2.5"
## [19] "Days.PM10"
The summary shows various states and their counties for the year 2017, with 19 variable model fitting functions.
Histogram plot was used to view the various types of AQI, counts and range. The binwith and color was chosen based on the number of days in the year and AQI value of days
A daily index value is calculated for each air pollutant measured. The highest of those index values is the AQI value, and the pollutant responsible for the highest index value is the “Main Pollutant.” These columns give the number of days each pollutant measured was the main pollutant. A blank column indicates a pollutant not measured in the county or CBSA.
The graph infers that Ozone and PM2.5 were the Main pollutants. CO was least measured pollutant in the county.
Which counties in the United States have the highest levels of ozone pollution in year 2017?
Get the data directly by plotting Days.with.AQI and State, choose Days.with.AQI, since it is the highest of daily index value calculated for each air pollutant measured.
Create variable called state_ranking with log 10 of Days.with.AQI.
## 'data.frame': 1044 obs. of 3 variables:
## $ State : Factor w/ 54 levels "Alabama","Alaska",..: 5 5 5 6 14 17 17 17 17 17 ...
## $ County : Factor w/ 809 levels "Abbeville","Ada",..: 274 385 610 395 46 172 424 504 553 580 ...
## $ air_quality_data_log10: num 2.48 2.48 2.48 2.48 2.48 ...
Highly polluted States list
## State County air_quality_data_log10
## 1 California Fresno 2.482874
## 2 California Kings 2.482874
## 3 California Riverside 2.482874
## 4 Colorado La Plata 2.482874
## 5 Idaho Bannock 2.482874
## 6 Iowa Clinton 2.482874
## 7 Iowa Linn 2.482874
## 8 Iowa Muscatine 2.482874
## 9 Iowa Palo Alto 2.482874
## 10 Iowa Polk 2.482874
Least polluted States list
## State County air_quality_data_log10
## 1035 Oregon Clackamas 1.0000000
## 1036 Utah Garfield 1.0000000
## 1037 Washington Klickitat 1.0000000
## 1038 California Del Norte 0.9542425
## 1039 Montana Powell 0.9542425
## 1040 Nebraska Thomas 0.9542425
## 1041 Oregon Wallowa 0.9542425
## 1042 South Carolina Abbeville 0.9542425
## 1043 Idaho Custer 0.9030900
## 1044 Montana Roosevelt 0.6989700
Interestingly California is both in the list of high and low polluted state. The county Del Norte in California is least polluted. Counties Fresno, Kings, and Riverside in California were highly polluted.
The missing parameter in the dataset is the mean value of Good and Sensitive days in the year 2017
## # A tibble: 20 x 5
## # Groups: State, County [20]
## State County Days.with.AQI mean_Good_AQI mean_Sensitive_A~
## <fct> <fct> <int> <dbl> <dbl>
## 1 Alabama Baldwin 135 0.822 0.00741
## 2 Alabama Clay 58 0.845 0.
## 3 Alabama Colbert 140 0.914 0.
## 4 Alabama DeKalb 241 0.909 0.
## 5 Alabama Elmore 105 0.952 0.
## 6 Alabama Etowah 178 0.663 0.00562
## 7 Alabama Houston 140 0.893 0.00714
## 8 Alabama Jefferson 243 0.424 0.0247
## 9 Alabama Lawrence 11 1.00 0.
## 10 Alabama Madison 167 0.856 0.
## 11 Alabama Mobile 181 0.768 0.0276
## 12 Alabama Montgomery 178 0.708 0.0112
## 13 Alabama Morgan 181 0.851 0.
## 14 Alabama Russell 133 0.842 0.
## 15 Alabama Shelby 263 0.954 0.
## 16 Alabama Sumter 170 0.924 0.
## 17 Alabama Talladega 60 0.783 0.
## 18 Alabama Tuscaloosa 171 0.614 0.00585
## 19 Alaska "Aleutians East " 11 1.00 0.
## 20 Alaska "Anchorage " 181 0.818 0.00552
The values can be plotted against the number of days in the year having an AQI value
The plot provides the outlier values of both good days AQI and the sensitive days AQI.
Scatterplot help us to view the number of Sensitive group days vs Good days for each State.
Back to our least polluted and most polluted states, in this section we will plot the data in US map, the main challenge is EPA data don’t have the states lat and long, finally got the answer from
https://ekburchfield.files.wordpress.com/2017/05/web_scraping_in_r.pdf
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/GeographyOfAmericanMusic.html
## state.abb state.name state.area x y
## Alabama AL Alabama 51609 -86.7509 32.5901
## Arizona AZ Arizona 113909 -111.6250 34.2192
## Arkansas AR Arkansas 53104 -92.2992 34.7336
## California CA California 158693 -119.7730 36.5341
## Colorado CO Colorado 104247 -105.5130 38.6777
## state.division state.region Population Income Illiteracy
## Alabama East South Central South 3615 3624 2.1
## Arizona Mountain West 2212 4530 1.8
## Arkansas West South Central South 2110 3378 1.9
## California Pacific West 21198 5114 1.1
## Colorado Mountain West 2541 4884 0.7
## Life.Exp Murder HS.Grad Frost Area
## Alabama 69.05 15.1 41.3 20 50708
## Arizona 70.55 7.8 58.1 15 113417
## Arkansas 70.66 10.1 39.9 65 51945
## California 71.71 10.3 62.6 20 156361
## Colorado 72.06 6.8 63.9 166 103766
## long lat group order region subregion State County
## 1 -86.50517 32.34920 1 1 alabama autauga Alabama Autauga
## 2 -86.53382 32.35493 1 2 alabama autauga Alabama Autauga
## 3 -86.54527 32.36639 1 3 alabama autauga Alabama Autauga
## 4 -86.55673 32.37785 1 4 alabama autauga Alabama Autauga
## 5 -86.57966 32.38357 1 5 alabama autauga Alabama Autauga
## 6 -86.59111 32.37785 1 6 alabama autauga Alabama Autauga
## State County Days.with.AQI mean_Good_AQI mean_Sensitive_AQI
## 102 Alabama Baldwin 135 0.8222222 0.007407407
## 109 Alabama Baldwin 135 0.8222222 0.007407407
## 40 Alabama Baldwin 135 0.8222222 0.007407407
## 95 Alabama Baldwin 135 0.8222222 0.007407407
## 81 Alabama Baldwin 135 0.8222222 0.007407407
## 53 Alabama Baldwin 135 0.8222222 0.007407407
## 66 Alabama Baldwin 135 0.8222222 0.007407407
## 104 Alabama Baldwin 135 0.8222222 0.007407407
## 54 Alabama Baldwin 135 0.8222222 0.007407407
## 79 Alabama Baldwin 135 0.8222222 0.007407407
## 64 Alabama Baldwin 135 0.8222222 0.007407407
## 78 Alabama Baldwin 135 0.8222222 0.007407407
## 89 Alabama Baldwin 135 0.8222222 0.007407407
## 65 Alabama Baldwin 135 0.8222222 0.007407407
## 67 Alabama Baldwin 135 0.8222222 0.007407407
## 92 Alabama Baldwin 135 0.8222222 0.007407407
## 80 Alabama Baldwin 135 0.8222222 0.007407407
## 98 Alabama Baldwin 135 0.8222222 0.007407407
## 96 Alabama Baldwin 135 0.8222222 0.007407407
## 103 Alabama Baldwin 135 0.8222222 0.007407407
## long lat group order region subregion
## 102 -87.93757 31.14599 2 53 alabama baldwin
## 109 -87.93183 31.15171 2 54 alabama baldwin
## 40 -87.92037 31.15171 2 55 alabama baldwin
## 95 -87.90318 31.16318 2 56 alabama baldwin
## 81 -87.89173 31.16318 2 57 alabama baldwin
## 53 -87.88026 31.15171 2 58 alabama baldwin
## 66 -87.87453 31.16318 2 59 alabama baldwin
## 104 -87.87453 31.18036 2 60 alabama baldwin
## 54 -87.86308 31.18036 2 61 alabama baldwin
## 79 -87.85735 31.18609 2 62 alabama baldwin
## 64 -87.86308 31.19182 2 63 alabama baldwin
## 78 -87.87453 31.19755 2 64 alabama baldwin
## 89 -87.86880 31.20901 2 65 alabama baldwin
## 65 -87.84016 31.20901 2 66 alabama baldwin
## 67 -87.83443 31.22047 2 67 alabama baldwin
## 92 -87.85162 31.24339 2 68 alabama baldwin
## 80 -87.83443 31.24339 2 69 alabama baldwin
## 98 -87.82870 31.24912 2 70 alabama baldwin
## 96 -87.82870 31.26058 2 71 alabama baldwin
## 103 -87.81724 31.27777 2 72 alabama baldwin
Plotted the county boundaries with the state name, the region most polluted was captured as darker region it means lower percentage of good days in the year. The colour scales were captured in the bottom.
Complete by finding the correlation between good AQI days and the sensitive AQI days.
## [1] "-0.66"
The correlation value r = -0.66
Number of days in the year having an AQI value 0 through 50 means the good days for each counties. Similar AQI days in each counties will help us to get the healthy states in USA. Florida and District Of Columbia were chosen to plot the good AQI values for each counties.
http://paulorenato.com/index.php/171
Binwidht and the scales in facet_wrap helped to get the clear view of good AQI days in each county.
In the above graph Florida have the maximum y scale of 4, which means 4 counties have 131 good AQI days, the counties were Miami-Dade, Palm Beach, Pinellas and Sarasota. Same applicable for 149 good AQI days.
District Of Columbia state only have one county with 157 good AQI days in the year 2017.
To get the bird’s eye view of pollution in each state go back to map plot, in order to get better picture of pollution in each state, need to consider the unhealthy days for sensitive groups with AQI level 101 to 150.
Explore further based on state area
## state.abb state.name state.area
## 1 AZ Arizona 113909
## 2 CA California 158693
## 3 CO Colorado 104247
## 4 MT Montana 147138
## 5 NV Nevada 110540
## 6 NM New Mexico 121666
The green dot indicates the top biggest states in US
While starting the project want to plot the pollution levels of various states in USA, after exploring the available variables in the datasets the direction changed to find the least polluted and most polluted states.
The simple univariate challenge gave a lots of thought about how to handle the data.
The next interesting part of finding is the correlation between sensitive AQI days with respect to good AQI days and it reflected a strong downhill (negative) linear relationship between Good days AQI and Sensitive days AQI.
Further exploring the number of days good AQI (value 0 through 50) and number of days unhealthy for sensitive AQI (value 51 through 100), it was interesting to plot the data in US map, so that we can get the complete view of pollution in each counties in US.
There is lots of room to further explore the same data set for relationship between ozone and PM2.5 pollutant, identifying which is the most polluting component in each state, how the population in each state will impact the pollution and vice versa.